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Abstract 

Classic statistical techniques (like the multi-dimensional likelihood and the Fisher discrim- 
inant method) together with Multi-layer Perceptron and Learning Vector Quantization Neural 
Networks have been systematically used in order to find the best sensitivity when searching 
for — > v T oscillations. We discovered that for a general direct v T appearance search based 
on kinematic criteria: a) An optimal discrimination power is obtained using only three vari- 
ables (E v isiUe, Pt l%ss and pi) and their correlations. Increasing the number of variables (or 
combinations of variables) only increases the complexity of the problem, but does not result 
in a sensible change of the expected sensitivity, b) The multi-layer perceptron approach offers 
the best performance. As an example to assert numerically those points, we have considered 
the problem of v T appearance at the CNGS beam using a Liquid Argon TPC detector. 
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1 Introduction 



The experimental confirmation that atmospheric and solar neutrinos do oscillate pfl E] , and there- 
fore have mass, represents the first solid clue for the existence of new physics beyond the Standard 
Model P|. Results from experiments carried out with neutrinos produced in artificial sources, like 
reactors and accelerators, strongly support the fact that neutrinos are massive QIEl- 

Notwithstanding the impressive results achieved by current experiments, neutrino phenomenol- 
ogy is a very rich and active field, where plenty of open questions still await for a definitive answer. 
Thus, many next-generation neutrino experiments are being designed and proposed to measure 
with precision the parameters that govern the oscillation (mass differences and mixing angles) [jp . 
New facilities like super-beams, beta beams [J] and neutrino factories [8J have been put forward 
and their performances studied in detail in order to ascertain whether they can give an answer to 
two fundamental questions: what is the value of the mixing angle between the first and the third 
family, and whether CP violation takes place in the leptonic sector 

Recently, the Super-Kamiokande Collaboration has measured a first evidence of the sinusoidal 
behaviour of neutrino disappearance as dictated by neutrino oscillations ^HI- However, although 
the most favoured hypothesis for the observed disappearance is that of — » v T oscillations, 
no direct evidence for v T appearance exists up to date. A long baseline neutrino beam, optimized 
for the parameters favoured by atmospheric oscillations, has been approved in Europe to look for 
explicit v r appearance: the CERN-Laboratori Nazionali del Gran Sasso (CNGS) beam [TTj . The 
approved experimental program consists of two experiments ICARUS \%\ and OPERA (TJ] that 
will search for — > v r oscillations using complementary techniques. 

Given the previous experimental efforts |141 115j and present interest in direct v T appearance, 
we assess in this note the performance of several statistical techniques applied to the search for 
v r using kinematic techniques. Classic statistical methods (like multi-dimensional likelihood and 
Fischer's discriminant schemes) and Neural Networks based ones (like multi-layer perceptron and 
self-organized neural networks) have been applied in order to find the approach that offers the 
best sensitivity. 

2 Oscillation Search Using Kinematic Criteria 

The original proposal to observe for the first time the direct appearance of a v T by means of 
kinematic criteria dates back to 1978 |16| . Based on the capabilities to measure the direction of the 
hadronic jet, the interaction of the neutrino associated with the tau lepton can be spotted thanks 
to: a) the presence of a sizable missing transverse momentum; b) certain angular correlations 
between the direction of the prompt lepton and the hadronic jet, in the plane transverse to the 
incoming neutrino beam direction. 

NOMAD 23] was a pioneering experiment in the use of kinematic criteria applied to a — ► 
v T oscillation search. The kinematic approach was validated after several years of successful 
operation at the CERN WANF neutrino beam This short-baseline experiment set the 

most competitive limit for —>> v T oscillations at high values of Am 2 (Hp . 

An impressive background rejection power O(10 5 ) was needed in NOMAD. To achieve this, a 
multidimensional likelihood was built taking advantage of: on the one hand, the different event 
kinematics for signal and background events; on the other, the existing correlations among the 
variables used. To further enhance the sensitivity, the signal region was divided into several bins. 

Given the interest that — > v T oscillation searches have nowadays for the region of Am 2 ~ 
10 -3 eV 2 , we have considered the problem of finding the statistical approach that offers the best 
sensitivity for this kind of search. Unlike NOMAD, we do not try to improve the sensitivity by 
splitting the signal regions into a set of independent bins. 

We have simply compared the discrimination power offered by a multi-dimensional likelihood, 
the Fisher discriminant method and a neural network. As a general conclusion, we have observed 
that neural networks offer the best background rejection power thanks to their ability to find 
complex correlations among the kinematic variables. In addition, they allow to reduce the com- 
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plexity of the problem, given that a small number of input variables is enough to optimize the 
experimental sensitivity. These conclusions are valid for direct v T appearance searches performed 
either with atmospheric or accelerator neutrinos. In what follows we give a numerical example 
that illustrates the conclusions of this study. 

3 Detector Configuration and Data Simulation 

To obtain a numerical evaluation of the performances of the different statistical techniques we 
used, and assess which of them gives the best sensitivity when searching for direct v T appearance 
by means of kinematic criteria, we have considered the particular case of the CNGS beam. 

We assume a detector configuration consisting of 3 ktons of Liquid Argon ^2]- I n our simulation 
the total mass of active (imaging) Argon amounts to 2.35 ktons. We assumed five years running 
of the CNGS beam in shared mode (4.5 x 10 19 p.o.t. per year), which translates into a total 
exposure of 5 x 2.35 = 11.75 ktonxyear. The total event rates expected are 252 (17) v e (u e ) CC 
events and 50 v T CC events with the r decaying into an electron plus two neutrinos (we assume 
maximal mixing and Am| 3 = 3 x 1CT 3 eV 2 ; these values are compatible with the allowed range 
given by atmospheric neutrinos). Before cuts, the signal over background ratio, in active LAr, is 
50/252 ~ 0.2. 

The study of the capabilities to reconstruct and analyze high-energy neutrino events was done 
using fully simulated v e CC events inside the whole LAr active volume. Neutrino cross sections 
and the generation of neutrino interactions is based on the NUX code [2Qj|; final state particles 
are then tracked using the FLUKA package [2"T] . The angular and energy resolutions used in the 
simulation of final state electrons and individual hadrons are identical to those quoted in ^2] . 

In order to apply the most efficient kinematic selection, it is mandatory to reconstruct with 
the best possible resolution the energy and the angle of the hadronic jet and the prompt lepton, 
with particular attention to the tails of the distributions. Therefore, the energy flow algorithm 
has been designed with care, taking into account the needs of the tau search analysis. 

The ability to look for tau appearance events is limited by the containment of high energy 
neutrino events. Energy leakage outside the active imaging volume creates tails in the kinematic 
variables that fake the presence of neutrinos in the final state. We therefore impose fiducial cuts 
in order to guarantee that on average the events will be sufficiently contained. 

The fiducial volume is defined by looking at the profiles of the total missing transverse mo- 
mentum and of the total visible energy of the events. The average value of these variables is a 
good estimator of how much energy is leaking on average. After fiducial cuts, we keep 65% of the 
total number of events occurring in the active LAr volume. This means a total exposure of 7.6 
ktonxyear after five years of shared CNGS running. 

Table summarizes the total amount of simulated data used for this study. We note that 
v T (v e ) CC sample, generated in active LAr, is more than a factor 250 (50) larger than the 
expected number of collected events after five years of CNGS running. 



Process 


v e CC 


v T CC 


Active LAr 


14200 [252] 


13900 [50] 


Fiducial Vol. 


9250 [163] 


9000 [33] 



Table 1: Amount of fully generated data in Active and Fiducial LAr volumes. Between brackets 
we show the expected number of events after five years of data taking at CNGS with a 3 kton 
detector. 
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4 Statistical Pattern Recognition Applied to Oscillation Searches 



In the case of a — > v T oscillation search with Liquid Argon, the golden channel to look for v T 
appearance is the decay of the tau into an electron and a pair neutrino anti-neutrino due to: (a) 
the excellent electron identification capabilities; (b) the low background level, since the intrinsic 
v e and v e charged current contamination of the beam is at the level of one per cent. 

Kinematic identification of the r decay QH]> which follows the v T Q,Q interaction, requires 
excellent detector performance: good calorimetric features together with tracking and topology 
reconstruction capabilities. In order to separate v T events from the background, a basic criteria 
can be used: an unbalanced total transverse momentum due to neutrinos produced in the r decay. 

In figureQlwe illustrate the difference on kinematics for signal and background events. We plot 
four of the most discriminating variables: 

• E V i S : Visible energy. 

• P™ ss : Missing momentum in the transverse plane with respect to the direction of the 
incident neutrino beam. 

• P^ p : Transverse momentum of the prompt electron candidate. 

plep 

• ^ plep _|_ phad _|_ pmiss 

Signal events tend to accumulate in low E V i S , low P^ p , low pi and high P™ ss regions. 

Throughout this article, we take into account only the background due to electron neutrino 
charged current interactions. Due to the low content the beam has on v e , charged currents in- 
teractions of this type have been observed to give a negligible contribution to the total expected 
background. We are confident that neutral current background can be reduced to a negligible level 
using LAr imaging capabilities and algorithms based on the different energy deposition showed by 
electrons and 7r°(see for example (22l)- Therefore it will not be further considered. The contam- 
ination due to charm production and CC events, where the prompt muon is not identified as 
such, was studied by the ICARUS Collaboration [22] and showed to be less important than v e CC 
background. 



4.1 Oscillation Search Using Classic Statistical Methods 

4.1.1 The Multi-dimensional Likelihood 

The first method adopted for the r appearance search is the construction of a multi-dimensional 
likelihood function (see for example j2U), which is used as the unique discriminant between signal 
and background. This approach is, a priori, an optimal discrimination tool since it takes into 
account correlations between the chosen variables. 

A complete likelihood function should contain five variables (three providing information of the 
plane normal to the incident neutrino direction and two more providing longitudinal information) . 
However, in a first approximation, we limit ourselves to the discrimination information provided 
by the three following variables: E V i S n,i e , P™ lss and pi. 

As we will see later, all the discrimination power is contained in these variables, therefore we 
can largely reduce the complexity of the problem without affecting the sensitivity of the search. 
Two likelihood functions were built, one for r signal (£s) and another for background events (£g). 
The discrimination was obtained by taking the ratio of the two likelihoods: 

l^(\\—r(\T? pmiss „]\ £-s([Eyisible, Pt i Pi]) f-.\ 

ln{\) = £.([E msMe ,P T ,pi\) = — -j. — iss y (1) 

*-B([&visible, *r ' Pll) 

In order to avoid a bias in our estimation, half of the generated data was used to build the 
likelihood functions and the other half was used to evaluate overall efficiencies. Full details about 
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Main kinematical variables for x searching 




Figure 1: Visible energy (top left), transverse missed momentum (bottom left), transverse electron 
momentum (top right) and pi (bottom right). Histograms have an arbitrary normalization. 
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Figure 2: Comparison for "fiat" E V i S ibi e , P? v ', pi and P™ %ss variables between t signal and v e CC 
events. Arbitrary normalization has been taken into account when plotting background events. 



the multi-dimensional likelihood algorithm can be found elsewhere [22]. However, we want to 
point out here some important features of the method. 

A partition of the hyperspace of input variables is required: The multi-dimensional likelihood 
will be, in principle, defined over a lattice of bins. The number of bins to be filled when constructing 
likelihood tables grows like n d where n is the number of bins per variable and d the number of these 
variables. This leads to a "dimensionality" problem when we increment the number of variables, 
since the amount of data required to have a well defined value for InA in each bin of the lattice 
will grow exponentially. 

In order to avoid regions populated with very few events, input variables must be redefined 
to have the signal uniformly distributed in the whole input hyperspace, hence E V i s iu e > P™ tss and 
pi are replaced by "flat" variables (see figure EJ. Besides, an adequate smoothing algorithm is 
needed in order to alleviate fluctuations in the distributions in the hyperspace and also, to provide 
a continuous map from the input variables to the multi-dimensional likelihood one (InA). 

Ten bins per variable were used, giving rise to a total of 10 3 bins. Figure shows the likelihood 
distributions for background and tau events assuming five years running of CNGS (total exposure 
of 7.6 kton x year for events occurring inside the fiducial volume). 

Table [2 shows, for different cuts of InA, the expected number of tau and v e CC background 
events. As reference for future comparisons, we focus our attention in the cut InA > 1.8. It gives a 
signal selection efficiency around 25% (normalized to the total number of r events in active LAr). 
This t efficiency corresponds to 12.9 signal events. For this cut, we expect 1.1 ± 0.2 background 
events. After cuts are imposed, this approach predicts a S/B ratio similar to 13. 
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Likelihood Variable 




Figure 3: Multi-dimensional likelihood distributions for u e CC and r — > e events. The last bin 
in signal includes the event overflow. Error bars in u e CC + v T CC sample represent statistical 
fluctuations in the expected profile measurements after 5 years of data taking with shared running 
CNGS and a 3 kton detector configuration. 
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v T CC (r -> e) 




v T CC (r -» e) 

Am 2 = 
3 x 10" 3 eV 2 


Cuts 


Efficiency 

(%) ' 


V e CC 


Initial 


100 


252 


50 


Fiducial volume 


65 


163 


33 


In A > 0.0 


48 


6.8 ±0.5 


24.0 ±0.6 


In A > 0.5 


42 


3.6 ±0.3 


20.8 ±0.6 


In A > 1.0 


36 


2.5 ±0.3 


18.0 ±0.6 


In A > 1.5 


30 


1.7±0.2 


15.2 ±0.5 


In A > 1.8 


25 


1.1 ±0.2 


12.9 ±0.5 


In A > 2.0 


23 


0.86 ±0.16 


11.7±0.5 


In A > 2.5 


16 


0.40 ±0.12 


8.1 ±0.4 


In A > 3.0 


10 


0.22 ±0.08 


5.2 ±0.3 


In A > 3.5 


7 


0.12 ±0.06 


3.3 ±0.2 



Table 2: Expected number of v e QQ background and signal events in the r — > e analysis. A multi- 
dimensional likelihood function is used as the unique discriminant. Numbers are normalized to 5 
years running of CNGS. Errors in the number of expected events are of statistical nature. 

4.1.2 The Fisher Discriminant Method 

The Fisher discriminant method [2U is a standard statistical procedure that, starting from a large 
number of input variables, allows us to obtain a single variable that will efficiently distinguish 
among different hypotheses. As in the likelihood method, the Fisher discriminant will contain all 
the discrimination information. 

The Fisher approach tries to find a linear combination of the following kind 

n 

t({ x j}) = a o + ajXj 
of an initial set of variables {xj } which maximizes 

- ( /? ~ y ! (2) 

\"sig "bkg) 

where i is the mean of the t variable and a its variance. This last expression is nothing but a 
measure, for the variable t, of how well separated signal and background are. Thus, by maximizing 
<|2| we find the optimal linear combination of initial variables that best discriminates signal from 
background. The parameters dj which maximize (J2J can be obtained analytically by (see |2l] l 

a i = W i f^f a -t^ a ) (3) 

where fij 13 and [i^ 9 are the mean in the variable Xj for signal and background respectively, and 
W = Vsig + Vbkg, being V the covariance matrices. 

A Fisher Function for v T Appearance Search 

From the distributions of kinematic variables for j/ r CC and f e CC, we can immediately construct 
a Fisher function for a given set of variables. Initially we select the same set of variables we used 
for the likelihood approach, namely: E V i S n,i e , P™ ss and pi. We need only the vector of means and 
covariance matrices in order to calculate the optimum Fisher variable (equation Distributions 
are shown in figure 21 where the usual normalization has been assumed. In table El values for 
the expected number of signal and background events are shown as a function of the cut on the 
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Fischer discriminant separation capabilities 




-3-2-10 1 2 3 4 

Fisher 



Figure 4: The Fischer discriminant variable. Error bars in u e CC + is T CC sample represent sta- 
tistical fluctuations in the expected profile measurements after 5 years of data taking with shared 
running CNGS and a 3 kton detector configuration 

Fisher discriminant. Since linear correlations among variables are taken into account, the Fisher 
discriminant method offers similar results to the one obtained using a multi-dimensional likelihood. 

Contrary to what happens with a multi-dimensional likelihood (where the increase in the 
number of discriminating variables demands more Monte-Carlo data and therefore it is an extreme 
CPU-consuming process), the application of the Fisher method to a larger number of kinematic 
variables is straightforward, since the main characteristic of the Fisher method is that the final 
discriminant can be obtained algebraically from the initial distributions of kinematic variables. 
For instance, a Fisher discriminant built out of 9 kinematic variables (E vis , P™ lss , pi, Pi^ p , Ei ep , 
Pm, Qt, Tn-T, Qiep) 1 predicts for 12.9 ± 0.3 taus a background of 1.17 ± 0.14 v e CC events. We 
conclude that, for the Fisher method, increasing the number of variables does not improve 
the discrimination power we got with the set E V i S , P™ lss , pi and therefore these three 
variables are enough to perform an efficient r appearance search. 

1 see 1181 for a detailed explanation of the variables 
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v T CC (t — e) 




v T CC (r -» e) 


Cuts 


Efficiency 


^ e CC 


Am 2 = 




(%) ' 




3 x 10" 3 eV 2 


Initial 


100 


252 


50 


Fiducial volume 


65 


164 


33 


Fisher > 0.5 


46 


6.9 ±0.3 


23.1 ±0.4 


Fisher> 0.0 


33 


2.4 ±0.2 


16.6 ±0.4 


Fisher> 0.27 


25 


1.15 ±0.13 


12.9 ± 0.3 


Fisher> -0.5 


20 


0.60 ±0.10 


10.2 ±0.3 


Fisher> -1.0 


10 


0.14 ±0.05 


5.2 ±0.2 



Table 3: Expected number of v e CC background and signal events in the r — ► e analysis. A Fisher 
variable is used as the unique discriminant. Numbers are normalized to 5 years running of CNGS. 
Errors in the number of expected events are of statistical nature. 

4.2 Oscillation Search Using Neural Networks 

In the context of signal vs background discrimination, neural networks arise as one of the most 
powerful tools. The crucial point that makes these algorithms so good is their ability to adapt 
themselves to the data by means of non-linear functions. 

Artificial Neural Networks have become a promising approach to many computational applica- 
tions. It is a mature and well founded computational technique able to learn the natural behaviour 
of a given data set, in order to give future predictions or take decisions about the system that 
data represent (see j2U and [J6j for a complete introduction to neural networks). During last 
decade, neural networks have been widely used to solve High Energy Physics problems (see 
for a introduction to neural networks techniques and applications to HEP). Multi-layer percep- 
trons efficiently recognize signal features from an, a priori, dominant background environment 

(EH!, M)- 

We have evaluated the performance offered by neural networks when looking for — > v r 
oscillations. As in the case of a multi-dimensional likelihood, a single valued function will be the 
unique discriminant. This is obtained adjusting the free parameters of our neural network model 
by means of a training period. During this process, the neural network is taught to distinguish 
signal from background using a learning data sample. 

Two different neural networks models have been studied: the multi-layer perceptron and the 
learning vector quantization self-organized network. In the following, the results obtained with 
both methods are discussed. 

4.2.1 The Multi-layer Perceptron 

The multi-layer perceptron (MLP) function has a topology based on different layers of neurons 
which connect input variables (the variables that define the problem, also called feature variables) 
with the output unit (see figure EJ. The value (or "state") a neuron has, is a non- linear function 
of a weighted sum over the values of all neurons in the previous layer plus a constant, called bias: 

s < = <kX>;-4~ 1+ ^) ( 4 ) 

3 

where s\ is the value of the neuron i in layer I; U)L is the weight associated to the link between 
neuron i in layer I and neuron j in the previous layer (I — 1); b\ is a bias defined in each neuron 
and g(x) is called the transfer function. The transfer function is used to regularize the neuron's 
output to a bounded value between and 1 (or -1,1). 

In a multi-layer perceptron, a non-linear function is used to obtain the discriminating variable. 
Therefore complex correlations among variables are taken into account, thus enhancing background 
rejection capabilities. 
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Layer (1-1) Layer 1 

Figure 5: A general multi-Layer perceptron diagram. The optimal non-linear function of input 
variables (xi) is constructed using a set of basic units called neurons. Each neuron has two free 
parameters that must be adjusted minimizing an error function. 

The construction of a MLP implies that several choices must be made a priori: amount of 
input variables, hidden layers, neurons per layer, number of epochs, etc. The size of the simulated 
data set is also crucial in order to optimize the training algorithm performance. If the training 
sample is small, it is likely for the MLP to adjust itself extremely well to this particular data set, 
thus losing generalization power (when this occurs the MLP is over-learning the data). 

Multi-layer Perceptron for v r Appearance Search 

As already mentioned in 14.1.11 we fully define our tagging problem using five variables (three 
in the transverse plane and two in longitudinal direction), since they utterly describe the event 
kinematics, provided that we ignore the jet structure. Initially we build a MLP that contains 
only three input variables, and in a latter step we incorporate more variables to see how the 
discrimination power is affected. The three chosen variables are E visi bie, P™ lss and pi, Our 
election is similar to the one used for the multi-dimensional likelihood approach. This allows us 
to make a direct comparison of the sensitivities provided by the two methods. 

The implementation of the multilayer perceptron was done by means of the MLPfit package 
30J, interfaced in PAW. Among the set of different neural network topologies that we studied, we 
saw that the optimal one is made of two hidden layers with four neurons in the first hidden layer 
and one in the second (see figure EJ). 

Simulated data was divided in three, statistically independent, subsets of 5000 events each 
(consisting of 2500 signal events and an identical amount of background) . 

The MLP was trained with a first "learning" data sample. Likewise, the second "test" data 
set was used as a training sample to check that over-learning does not occur. Once the MLP is 
set, the evaluation of final efficiencies is done using the third independent data sample (namely, 
a factor 40 (75) larger than what is expected for background (signal) after five years of CNGS 
running with a 3 kton detector). 

Error curves during learning are shown in figure [3 for training and test samples. We see that 
even after 450 epochs, over-learning does not take place. Final distributions in the multi-layer 
perceptron discriminating variable can be seen in figure El 
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Output Layer 



1 ~ Signal 

~ Background 



2 Hidden Layer 




1 Hidden Layer 



Figure 6: Chosen topology for the MLP. We feed a two layered MLP (4 neurons in first layer and 
1 in second) with input variables: E V i S iu e , P™ lss and pi. 



Figure El shows the number of signal and background expected after 5 years of data taking as 
a function of the cut in the MLP variable. In figure HPI we represent the probability of an event, 
falling in a region of the input space characterized by MLP output > cut, to be a signal event (top 
plot), and the statistical significance as a function of the MLP cut (bottom plot). Background 
rejection has been optimized since a cut based on the MLP output variable can select regions of 
complicated topology in the kinematic hyperspace, given that now complex correlations are taken 
into account (see figure HTf . 

Selecting MLP > 0.91 (overall r selection efficiency = 25%), the probability that an event 
falling in this region is signal amounts to ~ 0.95. For 5 years of running CNGS and a 3 kton 
detector, we expect a total amount of 12.9 ± 0.5 v T GG (r — » e) events and 0.66 ± 0.14 v e CG 
events. Table 21 summarizes as a function of the applied MLP cut the expected number of signal 
and background events. 

If we compare the outcome of this approach with the one obtained in section 13.1. II we see that 
for the same r selection efficiency, the multi-dimensional likelihood expects 1.1 ± 0.2 background 
events. Therefore, for this particular cut, the MLP achieves a 60% reduction in the number of 
expected v e CC events. 

As we did for the Fisher method, we studied if the sensitivity given by the MLP increases 
when a larger number of input variables is used. Even though the number of complex correlations 
among variables is larger, the change in the final sensitivity is negligible. Once again, all the 
discrimination power is provided by E visi ue , P™ lss and pi . The surviving background can not be 
further reduced by increasing the dimensionality of the problem. 

Since an increase on the number of input variables does not improve the discrimination power 
of the multi-layer perceptron, we tried to enhance signal efficiency following a different approach: 
optimizing the set of input variables by finding new linear combinations of the original ones (or 
functions of them like squares, cubes, etc). 

To this purpose, using the fast computation capabilities of the Fisher method, we can operate in 
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Figure 7: Learning curves for the MLP. The neural network is trained for 450 epochs in order to 
reach a stable minimum. The solid line represents the error on training sample, the dashed line is 
the error on the test sample. Both lines run almost parallel: no over-learning occurs. 





z/ T CC (t -> e) 




v T CC (r -» e) 


Cuts 


Efficiency 


V e CC 


Am 2 = 




(%) ' 




3 x 10" 3 eV 2 


Initial 


100 


252 


50 


Fiducial volume 


65 


164 


33 


MLP > 0.70 


42 


4.0 ±0.4 


21.4 ±0.6 


MLP > 0.75 


40 


3.0 ±0.3 


19.9 ±0.6 


MLP > 0.80 


37 


2.1 ±0.3 


18.6 ±0.5 


MLP > 0.85 


33 


1.5 ±0.2 


16.4 ±0.5 


MLP > 0.90 


27 


0.76 ±0.15 


13.5 ±0.5 


MLP > 0.91 


25 


0.66 ±0.14 


12.9 ± 0.5 


MLP > 0.95 


19 


0.28 ±0.09 


9.6 ±0.4 


MLP > 0.98 


12 


0.09 ±0.05 


5.8 ±0.3 



Table 4: Expected number of background and signal events when a multi-layer perceptron function 
is used as the unique discriminant. Numbers are normalized to 5 years running of CNGS. Errors 
in the number of events expected are of statistical nature. 
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MLP variable. 3 Kton detector, 5 years CNGS 




Figure 8: Multi-layer perceptron output for ^ r CC (r — > e) and f e CC events. We see how sig 
events accumulate around 1 while background peaks at 0. Only statistical errors are plotted. 
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3 Kton detector. 5 years CNGS 
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Figure 9: Number of signal and background events after 5 years of running CNGS as a function 
of the MLP cut. Shadowed zones correspond to statistical errors. 
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3 Kton detector. 5 years CNGS 
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Figure 10: (Top) Probability of an event belonging to a region in input variable space characterized 
by MLP > cut of being a signal event. (Bottom) Statistical significance of signal events as a 
function of the cut in MLP. 
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3 Kton detector. 5 years CNGS. 




Figure 11: Kinematic variables before (left histograms) and after (right histograms) cuts are 
applied based on the MLP output. We see how the MLP has learnt that signal events favour low 
E V isMe, high P?p lss and low pi values. 
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a systematic way in order to find the most relevant feature variables. Starting from an initial set of 
input variables, the algorithm described in pjj tries to gather all the discriminant information in an 
smaller set of optimized variables. These last variables are nothing but successive Fisher functions 
of different combinations of the original ones. In order to allow not only linear transformations, 
we can add non-linear functions of the kinematic variables like independent elements of the initial 
set. 

We performed an analysis similar to the one described in [HU, using 5 initial kinematic vari- 
ables (E V i S , P™ lss , pi, P l ^ v and Ei ep ) plus their cubes and their exponentials (in total 15 initial 
variables). At the end, we chose a smaller subset of six optimized Fisher functions that we use 
like input features variables for a new multi-layer perceptron. 

The MLP analysis with six Fisher variables does not enhance the oscillation search sensitivity 
that we got with the three usual variables E vis , P™ lss and pi. We therefore conclude that neither 
the increase on the number of features variables nor the use of optimized linear 
combinations of kinematic variables as input, enhances the sensitivity provided by 
the MLP. 

The application of statistical techniques able to find complex correlations among the input 
variables is the only way to enhance background rejection capabilities. In this respect, neural 
networks are an optimal approach. 

4.2.2 Self Organizing Neural Networks: LVQ Network 

A self-organizing (SO) network operates in a different way than a multi-layer perceptron does. 
These networks have the ability to organize themselves according to the "natural structure" of the 
data. They can learn to detect regularities and correlations in their input and adapt their future 
response to that input accordingly. A SO network usually has, besides the input, only one layer of 
neurons that is called competitive layer (see figure ffifo . Neurons in the competitive layer are able 
to learn the structure of the data following a simple scheme called competitive self-organization 
(see (23), which "moves" the basic units (neurons) in the competitive layer in such a way that 
they imitate the natural structure of the data. 

Competitive self-organization is an unsupervised learning algorithm, however for classification 
purposes one can improve the algorithm with supervised learning in order to fine tune final posi- 
tions of the neurons in the competitive layer. This is called learning vector quantization (LVQ) 
(for further details refer to jJUEZj)- An important difference with respect to the multi-layer per- 
ceptron approach is that in LVQ we always get a discrete classification, namely, an event is always 
classified in one of the classes. The only thing one can estimate is the degree of belief in the LVQ 
choice. 

LVQ Network for v T Appearance Search 

We use once more E V i S ; L \,i e , p?p tss and pi as discriminating variables inside the input layer. A 
LVQ network with 10 neurons has been trained with samples of 2500 events for both signal and 
background. Given that, before any cut, a larger background sample is expected, we have chosen 
an asymmetric configuration for the competitive layer. Out of 10 neurons, 6 were assigned to 
recognize background events, and the rest were associated to the signal class. After the neurons 
are placed by the training procedure, the LVQ network is fed with a larger and statistically 
independent data sample consisting of 6000 signal and background events. The output provided 
by the network is plotted in figurelT^l We see how events are classified in two independent classes: 
signal like events (labeled with 2) and background like ones (labeled with 1). 68% of v T CC (r — ► e) 
events and 10% of v e CC events, occurring in fiducial volume, are classified as signal like events. 
This means a r efficiency around 45% with respect to the tau events generated in active LAr. For 
the same r efficiency, the multi-layer perceptron only misclassified around 8% of v e CC events. 

Several additional tests have been performed with LVQ networks, by increasing the number 
of input variables and/or the number of neurons in the competitive layer. However, we observed 
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Competitive Layer Output 




Figure 12: Schematic diagram of the general topology for a self-organized neural network. Neurons 
in the competitive layer are connected with each one of the input nodes. 

no improvement on the separation capabilities. For instance, a topology with 16 feature neurons 
in the competitive layer and 4 input variables (we add the transverse lepton momentum) leads to 
exactly the same result. 

The simple geometrical interpretation of this kind of neural networks supports our statement 
that the addition of new variables to the original set {E V i S iu e , P™ iss , pi} does not enhance the 
discrimination power: the bulk of signal and background events are not better separated when we 
increment the dimensionality of the input space. 

Combining MLP with LVQ 

We have seen that LVQ networks returns a discrete output. The whole event sample is classified 
by the LVQ in two classes: signal-like and background-like. We can use the classification of a 
LVQ as a pre-classification for the MLP. A priori, it seems reasonable to expect an increase on the 
oscillation search sensitivity if we combine the LVQ and MLP approaches. The aim is to evaluate 
how much additional background rejection, from the contamination inside the signal-like sample, 
can be obtained by means of a MLP. 

We present in figurefPHthe MLP output for events classified as signal-like by the LVQ network 
(see figure fL3jl . Applying a cut on the MLP output such that we get 12.9 signal events (our usual 
reference point of 25% r selection efficiency), we get 0.82±0.19 background events, similar to what 
was obtained with the MLP approach alone. This outcome conclusively shows that, contrary to 
our a priori expectations, an event pre-classification, by means of a learning vector quantization 
neural network, does not help improving the discrimination capabilities of a multi-layer perceptron. 

5 v T Discovery Potential 

We have studied several pattern recognition techniques applied to the particular problem of search- 
ing for i/fj — > v T oscillations. Based on discovery criteria, similar to the ones proposed in for 
statistical studies of prospective nature, we try to quantify how much the statistical relevance of 
the r signal varies depending on the statistical method used. 
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LVQ Neural Network separation capabilities 




Figure 13: LVQ neural network separation capabilities. In competitive self-organized networks a 
discrete decision is always issued: signal like events are labeled with 2 and background like with 1. 
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LVQ and MLP combination 
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Figure 14: LVQ and MLP networks combined. Distributions are given in the continuous MLP 
variable. Only events labeled by LVQ with 2 (signal like) have been used for the analysis. 
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Multi-layer 


Multi-dimensional 




Perceptron 


Likelihood 


# Signal 


12.9 


12.9 


# Background 


0.66 


1.1 


a factor 


0.86 


1.01 



Table 5: Number of signal and background events for the multi-layer perceptron and the multi- 
dimensional likelihood approaches. Numbers are normalized to 5 years of data taking in shared 
CNGS running mode and a 3 kton detector configuration. The last row displays the scale factor 
a needed to compute the minimum exposure fulfilling the discovery criteria described in the text. 

We define fis and /is as the average number of expected signal and background events, re- 
spectively. With this notation, we impose two conditions to consider that a signal is statistically 
significant: 

1. We require that the probability for a background fluctuation, giving a number of events equal 
or larger than [is + /ib, be smaller than e (where e is 5.733 x 10 -7 , the usual 5cr criteria 
applied for Gaussian distributions). 

2. We also set at which confidence level (1 — S), the distribution of the total number of events 
with mean value /is + /is fulfills the background fluctuation criteria stated above. 

For instance, if 8 is 0.10 and e is 5.733 x 10~ 7 , we are imposing that 90% of the times we repeat 
this experiment, we will observe a number of events which is 5a or more above the background 
expectation. 

For all the statistical techniques used, we fix 5=0.10 and e=5.733 x 10~ 7 . In this way we can 
compute the minimum number of events needed to establish that, in our particular example, a 
direct — ► v T oscillation has been observed. 

In tableElwe compare the number of signal and background events obtained for the multi-layer 
perceptron and the multi-dimensional likelihood approaches after 5 years of data taking with a 
3 kton detector. We also compare the minimum exposure needed in order to have a statistically 
significant signal. The minimum exposure is expressed in terms of a scale factor a, where a = 1 
means a total exposure of 11.75 ktonxyear. For the multi-layer perceptron approach (a = 0.86), 
a statistically significant signal can be obtained after a bit more than four years of data taking. 
On the other hand, the multi-dimensional likelihood approach requires 5 full years of data taking. 
Therefore, when applied to the physics quest for neutrino oscillations, neural network techniques 
are more performant than classic statistical methods. 

6 Conclusions 

We have considered the general problem of — * v T oscillation search based on kinematic criteria 
to assess the performance of several statistical pattern recognition methods. 
Two are the main conclusions of this study: 

• An optimal discrimination power is obtained using only the following variables: E V i S n,i e , 
P™ lss and pi and their correlations. Increasing the number of variables (or combinations 
of variables) only increases the complexity of the problem, but does not result in a sensible 
change of the expected sensitivity. 

• Among the set of statistical methods considered, the multi-layer perceptron offers the best 
performance. 

As an example, we have considered the case of the CNGS beam and v T appearance search (for 
the t — ► e decay channel) using a very massive (3 kton) Liquid Argon TPC detector. Figure 
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compares the discrimination capabilities of multi-dimensional likelihood and multi-layer perceptron 
approaches. We see that, for the low background region, the multi-layer perceptron gives the best 
sensitivity. For instance, choosing a r selection efficiency of 25% as a reference value, we expect 
a total of 12.9 ± 0.5 i/ T CC (r — > e) signal and 0.66 ± 0.14 v e CC background. Compared to 
multi-dimensional likelihood predictions, this means a 60% reduction on the number of expected 
background events. Hence, using a multi-layer perceptron, fours years of data taking will suffice 
to get a statistically significant signal, while five years are needed when the search approach is 
based on a multi-dimensional likelihood. 




Figure 15: Multi-layer perceptron vs multi-dimensional likelihood. We assume a 7.6 Ktonxyear 
exposure. The shadowed area shows the statistical error. 
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